单像人类的重新构成旨在通过将输入图像分解为反照率,形状和照明,以在新的照明条件下重新确定目标人。尽管可以实现合理的重新确定结果,但以前的方法均遭受反照率和照明之间的纠缠以及缺乏硬阴影的纠缠,这大大降低了现实主义。为了解决这两个问题,我们提出了一个几何学意识到的单像人类重心框架,该框架利用单位图几何重建来共同部署传统的图形渲染和神经渲染技术。对于脱光灯,我们探索了UNET架构的缺点,并提出了修改后的HRNET,从而在反照率和照明之间获得了更好的分解。为了获得重新,我们引入了一个基于射线跟踪的每个像素照明表示形式,该表示明确地对高频阴影进行了建模,并提出了一个基于学习的阴影修补模块,以恢复来自射线追踪的阴影图的逼真的逼真的阴影(包括硬铸造阴影)。我们的框架能够生成照片逼真的高频阴影,例如在挑战性的照明条件下铸造阴影。广泛的实验表明,我们提出的方法在合成图像和真实图像上都优于先前的方法。
translated by 谷歌翻译
为了解决由单眼人类体积捕获中部分观察结果引起的不足问题,我们提出了Avatarcap,这是一个新颖的框架,该框架将可动画的化身引入了可见和不可见区域中高保真重建的捕获管道中。我们的方法首先为该主题创建一个可动画化的化身,从少量(〜20)的3D扫描作为先验。然后给出了该主题的单眼RGB视频,我们的方法集成了图像观察和头像先验的信息,因此无论可见性如何,都会重新构建具有动态细节的高保真3D纹理模型。为了学习有效的头像,仅从少数样品中捕获体积捕获,我们提出了GeoteXavatar,该地理Xavatar利用几何和纹理监督以分解的隐式方式限制了姿势依赖性动力学。进一步提出了一种涉及规范正常融合和重建网络的头像条件的体积捕获方法,以在观察到的区域和无形区域中整合图像观测和化身动力学,以整合图像观测和头像动力学。总体而言,我们的方法可以通过详细的和姿势依赖性动力学实现单眼人体体积捕获,并且实验表明我们的方法优于最新的最新状态。代码可在https://github.com/lizhe00/avatarcap上找到。
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Masked image modeling (MIM) has shown great promise for self-supervised learning (SSL) yet been criticized for learning inefficiency. We believe the insufficient utilization of training signals should be responsible. To alleviate this issue, we introduce a conceptually simple yet learning-efficient MIM training scheme, termed Disjoint Masking with Joint Distillation (DMJD). For disjoint masking (DM), we sequentially sample multiple masked views per image in a mini-batch with the disjoint regulation to raise the usage of tokens for reconstruction in each image while keeping the masking rate of each view. For joint distillation (JD), we adopt a dual branch architecture to respectively predict invisible (masked) and visible (unmasked) tokens with superior learning targets. Rooting in orthogonal perspectives for training efficiency improvement, DM and JD cooperatively accelerate the training convergence yet not sacrificing the model generalization ability. Concretely, DM can train ViT with half of the effective training epochs (3.7 times less time-consuming) to report competitive performance. With JD, our DMJD clearly improves the linear probing classification accuracy over ConvMAE by 5.8%. On fine-grained downstream tasks like semantic segmentation, object detection, etc., our DMJD also presents superior generalization compared with state-of-the-art SSL methods. The code and model will be made public at https://github.com/mx-mark/DMJD.
translated by 谷歌翻译
Cohn and Umans proposed a framework for developing fast matrix multiplication algorithms based on the embedding computation in certain groups algebras. In subsequent work with Kleinberg and Szegedy, they connected this to the search for combinatorial objects called strong uniquely solvable puzzles (strong USPs). We begin a systematic computer-aided search for these objects. We develop and implement constraint-based algorithms build on reductions to $\mathrm{SAT}$ and $\mathrm{IP}$ to verify that puzzles are strong USPs, and to search for large strong USPs. We produce tight bounds on the maximum size of a strong USP for width $k \le 5$, construct puzzles of small width that are larger than previous work, and improve the upper bounds on strong USP size for $k \le 12$. Although our work only deals with puzzles of small-constant width, the strong USPs we find imply matrix multiplication algorithms that run in $O(n^\omega)$ time with exponent $\omega \le 2.66$. While our algorithms do not beat the fastest algorithms, our work provides evidence and, perhaps, a path to finding families of strong USPs that imply matrix multiplication algorithms that are more efficient than those currently known.
translated by 谷歌翻译
This paper presents a practical global optimization algorithm for the K-center clustering problem, which aims to select K samples as the cluster centers to minimize the maximum within-cluster distance. This algorithm is based on a reduced-space branch and bound scheme and guarantees convergence to the global optimum in a finite number of steps by only branching on the regions of centers. To improve efficiency, we have designed a two-stage decomposable lower bound, the solution of which can be derived in a closed form. In addition, we also propose several acceleration techniques to narrow down the region of centers, including bounds tightening, sample reduction, and parallelization. Extensive studies on synthetic and real-world datasets have demonstrated that our algorithm can solve the K-center problems to global optimal within 4 hours for ten million samples in the serial mode and one billion samples in the parallel mode. Moreover, compared with the state-of-the-art heuristic methods, the global optimum obtained by our algorithm can averagely reduce the objective function by 25.8% on all the synthetic and real-world datasets.
translated by 谷歌翻译
Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.
translated by 谷歌翻译
Despite excellent performance in image generation, Generative Adversarial Networks (GANs) are notorious for its requirements of enormous storage and intensive computation. As an awesome ''performance maker'', knowledge distillation is demonstrated to be particularly efficacious in exploring low-priced GANs. In this paper, we investigate the irreplaceability of teacher discriminator and present an inventive discriminator-cooperated distillation, abbreviated as DCD, towards refining better feature maps from the generator. In contrast to conventional pixel-to-pixel match methods in feature map distillation, our DCD utilizes teacher discriminator as a transformation to drive intermediate results of the student generator to be perceptually close to corresponding outputs of the teacher generator. Furthermore, in order to mitigate mode collapse in GAN compression, we construct a collaborative adversarial training paradigm where the teacher discriminator is from scratch established to co-train with student generator in company with our DCD. Our DCD shows superior results compared with existing GAN compression methods. For instance, after reducing over 40x MACs and 80x parameters of CycleGAN, we well decrease FID metric from 61.53 to 48.24 while the current SoTA method merely has 51.92. This work's source code has been made accessible at https://github.com/poopit/DCD-official.
translated by 谷歌翻译
The task of referring video object segmentation aims to segment the object in the frames of a given video to which the referring expressions refer. Previous methods adopt multi-stage approach and design complex pipelines to obtain promising results. Recently, the end-to-end method based on Transformer has proved its superiority. In this work, we draw on the advantages of the above methods to provide a simple and effective pipeline for RVOS. Firstly, We improve the state-of-the-art one-stage method ReferFormer to obtain mask sequences that are strongly correlated with language descriptions. Secondly, based on a reliable and high-quality keyframe, we leverage the superior performance of video object segmentation model to further enhance the quality and temporal consistency of the mask results. Our single model reaches 70.3 J &F on the Referring Youtube-VOS validation set and 63.0 on the test set. After ensemble, we achieve 64.1 on the final leaderboard, ranking 1st place on CVPR2022 Referring Youtube-VOS challenge. Code will be available at https://github.com/Zhiweihhh/cvpr2022-rvos-challenge.git.
translated by 谷歌翻译
Referring image segmentation aims to segment the target object described by a given natural language expression. Typically, referring expressions contain complex relationships between the target and its surrounding objects. The main challenge of this task is to understand the visual and linguistic content simultaneously and to find the referred object accurately among all instances in the image. Currently, the most effective way to solve the above problem is to obtain aligned multi-modal features by computing the correlation between visual and linguistic feature modalities under the supervision of the ground-truth mask. However, existing paradigms have difficulty in thoroughly understanding visual and linguistic content due to the inability to perceive information directly about surrounding objects that refer to the target. This prevents them from learning aligned multi-modal features, which leads to inaccurate segmentation. To address this issue, we present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features by guiding the interaction between vision and language through prior position information. Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment by comparing the features of the referred object with those of related objects. Extensive experiments on three benchmarks demonstrate our PCAN performs favorably against the state-of-the-art methods. Our code will be made publicly available.
translated by 谷歌翻译